-
Notifications
You must be signed in to change notification settings - Fork 697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-714] Add geopandas to spark arrow conversion. #1825
[SEDONA-714] Add geopandas to spark arrow conversion. #1825
Conversation
I ll fix the missing function issue |
Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame. I don't know when the release will be. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I'm new to this code base, so consider my comments optional nits 🙂
Starting from Spark 4.0, we can pass the whole arrow table to Spark.createDataFrame
Based on this PR I'm happy to attempt backporting GeoArrow import of anything implementing __arrow_c_stream__
, circumventing a materialize of the GeoPandas data frame as a follow-up 🙂
python/sedona/utils/geoarrow.py
Outdated
from pyspark.sql import SparkSession | ||
from pyspark.sql import DataFrame | ||
from pyspark.sql.types import StructType, StructField, DataType, ArrayType, MapType | ||
import pyarrow as pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the dependency situation is like for spark, but it may be worth making this a lazy import (e.g., like in dataframe_to_arrow
so that when we import from seconda.utils.geoarrow
from sedona/spark/__init__.py
we don't necessarily require pyarrow to be installed (alternatively, we could add pyarrow to the apache-sedona[spark]
extras to match the runtime requirement).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll make all the changes later today. Thank you for the review!
return [gen_new_name[name]() for name in names] | ||
|
||
|
||
def _deduplicate_field_names(dt: DataType) -> DataType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _deduplicate_field_names(dt: DataType) -> DataType: | |
# Backport from Spark 4.0 | |
# https://github.com/apache/spark/blob/3515b207c41d78194d11933cd04bddc21f8418dd/python/pyspark/sql/pandas/types.py#L1385 | |
def _deduplicate_field_names(dt: DataType) -> DataType: |
@paleolimbot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add documentation to this page? https://sedona.apache.org/latest/tutorial/geopandas-shapely/
sure |
Co-authored-by: Dewey Dunnington <dewey@wherobots.com>
f328661
to
1c96da0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the late review...this is awesome! Thank you!
Did you read the Contributor Guide?
Yes, I have read the Contributor Rules and Contributor Development Guide
No, I haven't read it.
Is this PR related to a JIRA ticket?
Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-XXX. The PR name follows the format
[SEDONA-XXX] my subject
.No:
[DOCS] my subject
[CI] my subject
What changes were proposed in this PR?
How was this patch tested?
Did this PR include necessary documentation updates?
vX.Y.Z
format.